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Abstract 

C/3 ■ 

Supervised statistical classification is a vital tool for satellite im- 
. age processing. It is useful not only when a discrete result, such as 

feature extraction or surface type, is required, but also for continuum 
retrievals by dividing the quantity of interest into discrete ranges. Be- 
cause of the high resolution of modern satellite instruments and be- 
cause of the requirement for real-time processing, any algorithm has 
to be fast to be useful. Here we describe an algorithm based on kernel 
estimation called Adaptive Gaussian Filtering that incorporates sev- 
eral innovations to produce superior efficiency as compared to three 
other popular methods: A;- nearest-neighbour (KNN), Learning Vector 
Quantization (LVQ) and Support Vector Machines (SVM). This effi- 
ciency is gained with no compromises: accuracy is maintained, while 
estimates of the conditional probabilities are returned. These are use- 
ful not only to gauge the accuracy of an estimate in the absence of its 
true value, but also to re-calibrate a retrieved image and as a proxy 
for a discretized continuum variable. The algorithm is demonstrated 
and compared with the other three on a pair of synthetic test classes 
and to map the waterways of the Netherlands. Software may be found 



at: http : / /libagf . sourcef orge . net 
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1 Introduction 



Remote-sensing satellite instruments are returning increasing amounts of in- 
formation about the Earth's surface and atmospheric state. To be useful, 
these data need to be processed to retrieve the quantities of interest, ide- 
ally in real time or better. Surface-detecting instruments, such as Landsat, 
MODIS and AVHRR in the visible and infra-red and AMSR-E and SSM/I 
in the microwave, are vital tools for mappin g the globe, especial ly when it is 



constantly shifting as in the case of sea ice fISpreen et all |2008| ). Statistical 



classification can help determine what the instrument is "seeing," whether 
that be crops, water, road or ice, underneath a given pixel or from a com- 
bination of pixels; for instanc e, picking out crops of a particular type in 



a Landsat image (jLaud . |2004| ). This type of study can be useful both for 
producing maps and gathering statistics. Even when a discrete type is not 
required, statistical classification can still be useful for continuum re trieval 
and i nversion by dividing the quantity of interest into discrete ranges (IMillsl . 
2009h . 

In supervised statistical classification, we are interested in determining 
the class, c of a test vector, x, based on a series of known input :output 
relations, {afj : q}, also known as training data. The vector, x, could corre- 
spond to a measurement vector, i.e. counts from several channels of one or 
more satellite instruments, while the class, c might correspond to a surface 
type, e.g. crops, forest, field, road, water, etc. The class is related to the 
inputs via a conditional probability, P{j\x), which is discretely represented 
by the training data. Normally, we seek the the most likely class (maximum 
likelihood estimation): 

c = argmaxP(j|x) (1) 
j 

thus we need some method of estimating the conditional probabilitie s. This 
is the function of a kernel- density esti mator (ITerrell and Scottl . 119921 ) and of 
a fc-nearest-neighbours (KNN) method (IMichie a/.Ul994l). O ther methods, 
such as Learning Vector Quantisation ( LVQ) ( iKohonenl . |2000| ) and Support 



Vector Machines (SVM) (jMiiller et all l200l[ ) skip this step in favour of di 



rectly determining the class domains. Note that kernel estimation should 
not be confused with methods based on the "kernel trick" such as SVM, 
described in section 14.2.31 

Because of the large amount of data involved and the necessity for real- 
time processing, an important feature of modern satellite inversion algo- 
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rithms, including statistical classification, is speed. In ImIUs fl2nn9[ ). a method 
for statistical classification called "Adaptive Gaussian Filtering" (AGF), based 
on kernel estimation, was briefly described and applied to Advanced Mi- 
crowave Sounding Unit (AMSU) data to discretely retrieve water vapour in 
the upper troposphere. Because this retrieval required several days of AMSU 
swath data, comprising tens of millions of individual measurements, the clas- 
sification algorithm had to be fast. Here we describe the algorithm more com- 
pletely and demonstrate its superior performance compared to three other 
popular methods by first applying them to an artificial test case and then 
using them to classify surface types in Landsat images. 

The AGF algorithm incorporates several critical innovations that make 
it extremely efficient without sacrificing accuracy or the ability to estimate 
the conditional probabilities. These are needed to set a definite confidence 
on the accuracy of an estimate in absence of knowledge of its true value. 
The following refinement s are applied to a varia ble-bandwidth, kernel den- 
sity "balloon" estimator (ITerrell and Scottl . 119921 ) based on Gaussian kernels. 
First, the filter width is matched to the sample density using the properties 
of the exponential function, avoiding unnecessary computation of exponen- 
tials (section l3.ip . Second, calculations are restricted to a set of fc- nearest- 
neighbours found in nlogk time with a binary tree (section 13. 2p . Third, 
because probability estimates are continuous, a pre-trained model can be 
generated by searching for the class- borders with guaranteed, super-linear 
convergence (sections 12.21 and 13.31) . Fourth, using the pre-trained model, the 
conditional probabilities are interpolated from gradients at the class-border 
(section 12. 3p . 



2 Essential description of the algorithm 



2.1 Adaptive Gaussian filtering 

The /c-nearest-neighbours (KNN) is a well-known and effective technique of 
estimating probability densities and performing classifications. It works by 
picking from the training data the k sample s near est the test point and 



determining the class by voting (IMichie et all Il994j ). A simple refinement 



to this method would be to weight the samples according to distance as in 
a simple linear filter. Given a set of points, {xi}, the probability density 
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function (PDF) of a test point, x, may be estimated as follows: 

« ^ (2) 

n 

W = (3) 

1=0 

where Wi is the weight of the ith sample, n is the number of samples and 
is a normalisation coefficie nt. Its iustif i cation is a Monte Carlo integration 



with importance sampling ( iPress et ali Il992l ). except that here we solve for 



the importance distribution by multiplying it with a peaked, but otherwise 
essentially arbitrary function, and assume that it is roughly constant. 

The magnitude of the weights must decrease with distance and in the 
isotropic case they are given by: 

Wi = f{di) (4) 
di — — |x 1 

where / is a filter function and di is the distance of the ith sample from 
the test point. The upright brackets denote a metric, typically Cartesian, 
although in theory any metric could be used. In practise, it is often simplest 
and most efficient to first re- scale or otherwise transform the variables so 
that a Cartesian metric is then appropriate. The normalisation coefficient 
will be given by: 

N = [ fi\x-ym (6) 



where A is the domain of the sampl es. This technique is kn own as a fixed- 
bandwidth kernel-density estimator (iTerrell and Scottl . ll992h . 



A natural choice for / would be a Gaussian: 

m = exp (7) 

where a is the filter width and = (27r)^/^cr'^ with D as the number of 
dimensions. Using a fixed filter width may mean that in regions of low 
density, all samples will fall in the tails of the filter with very low weighting, 
while regions of high density will find an excessive number of samples in the 
central region with weighting close to unity. Thus we vary the filter width 
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according to density so that an optimal number of samples falls within the 
central region. 

According to the definition of the probability density, the local average 
point spacing will be given as follows: 

^ = r pLvD (8) 
[nF[x)\ 

We wish to vary the filter width so that it matches the point spacing: 

aiopt) = = (9) 

where ki is a coefficient. Substituting the approximated PDF from ([2]) 
through ([7]) produces the following that must be solved for o"*-"^*-': 

Thus, correctly selecting a fixed value for W (call it Wc) in (E]) should produce 
an "optimal" (as we have defined it) filter width. The proof is valid for all 
functions for which oc which is true for all functions of the form f{r/a) 
for which the integral in exists. 

The quantity Wc may be thought of as roughly equivalent to /c in a KNN 
scheme. Here we have created what is known as a variable-bandwidth kernel- 
density "balloon" estimator: variable-bandwidth means that the filter width 
is varied depending on the location in the sample space, while "balloon" 
means that it is varied based on the location of the test point, not the training 
samples (as in a "pointwise estimator"), thus for a given estimate the same 



filter width is applied to all training samples f lTerrell and Scottl . |1992[ ) . 

To use the method for classification, we first estimate the conditional 
probability as follows: 

P(m\x) J2 (^^) 

i,Ci=m 

where q is the class associated with the ith sample. 



2.2 Finding the class borders 



The chief advantage of this scheme over a KNN is that it produces results 
that are both continuous and different iable. Both properties are desirable in 
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that they allow us to search for a unique border between the classes (discrim- 
ination border) and hence make more rapid classifications. Assuming that 
there are only two classes, the difference in their conditional probabilities is: 

R{x) = P{2\x) - P{l\x) ^ ^ 5^(2q - 3)^. (12) 

i 

where 1 and 2 are the classes and q is the class of the ith sample. The 
borders are found by setting this expression to zero, R{x) = 0. With an 
adaptive Gaussian filter, the derivative becomes: 



dR i [opt) 



'"'j -^i ^ ,o {opt) 



dx, ^ {a^opt)Yw^Z_. ^ ^ ^ ^ ^ . . ^ 

(13) 

where Xij is the jth coordinate of the ith sample, while Xj is the jth coor- 
dinate of the test point. Derivatives will be necessary both for estimating 
the class of the test point (see below) and then extrapolating the conditional 
probabilities (section [2.3p as well as useful (though not essential) for searching 
out the discrimination border (section l3.3p . 

The border may be sampled as many times as necessary, giving a set of 
vectors, {h}, along with their corresponding gradients, {V^-R(6j)}. The class 
of a test point, x, is calculated as follows: 

j = argmin|6j — x| (14) 

i 

P = {x-h,)-WsR{h,) (15) 
c = (3+p/b|)/2 (16) 

where c is the class. The specific procedure used to sample the class border 
will be described in section 13.31 

Note that it is easy to generalise a two-class classification to multiple 
classes, although the best method of doing so will be highly problem-dependent. 



2.3 Extrapolating the conditional probabilities 

The value of R may be extrapolated to the test point. Consider a pair of 
one-dimensional classes composed of two equal-sized Gaussians of width s 
separated by a distance 2a with the class border lying at h. Let R be the 
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difference between the conditional probabilities: 



Simplifying: 



R{x) = P{2\x) - P{l\x) (17) 

= ^p-->-^;^--) (18) 

P(l,x) + P(2,x) ^ ' 

(x-b+a)^ (x-b-a)'^ 

e + e 



R{x) = tanh ( ^^^^ ) (20) 



To approximate the difference in conditional probabilities, R{x), between an 
arbitrary pair of (well-behaved) 1-D classes, we fit R{x) to R{x) by setting: 

R{b) = Rib) = (21) 

dR(b) a ^ dR(b) 
dx dx 

We can approximate R for a pair of mult i- dimensional classes in the same 
way: 

R{x) ^ R{x) = tanhp (23) 

It is easy to show: 

R{bj) = (24) 
V^Rib,) = WsRih,) (25) 

The approximation assumes that the PDFs of the classes are roughly Gaus- 
sian near the discrimination border. 



3 Refinements 

3.1 Solving for the filter width 

The properties of the exponential function can be used to solve for the filter 
width by iteratively squaring the weights. Let the jth weighting coefficient 
of the ith sample be given as: 
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where the j'th filter width. Each subsequent iterate is defined as the 

square of its previous: 

= (27) 



consequently the jth filter width, a^^\ obeys the following recursion relation: 



,(i) 

with the final iteration of j, call it /, defined such that: 



Following our super- scripting convention, W^^"^ is defined as: W^^'^ = ^ • w- 



W^^'^ < (29) 

Obviously, the initial filter width, a^^^ , must be chosen to be larger than the 
optimal, for instance by taking the total variance of the data. 

The final filter width is approximated by exponential interpolation to the 
target total weight: 

' 2{\ogW(f) -logWif-^)) ^ ' 

The advantage of this scheme is that the exponentials, the most expensive 
part of the calculation, are computed only twice for each training sample. 
This is in contrast to a root-finding algorithm, which would need three com- 
putations at minimum. 



3.2 Restricting calculations to /c-nearest-neighbours 

Although the filter function, /, is applied in theory to all the samples, in 
practice only a subset of nearest neighbours will make a significant contri- 
bution to the final result. We can find k nearest neighbours in n\ogk time 
using a binary tree. The procedure is illustrated in figured] and described in 
words in the following paragraph. 

All the distance must be calculated and the k least of these will correspond 
to the desired neighbours. The first /c + 1 elements are arranged in a binary 
tree. The largest element will be the rightmost in the tree and must be 
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0. b. 




Figure 1: Demonstrating the use of a binary tree to select the 10 least ele- 
ments from a list of distances, {di}. In each illustration, elements to the left 
are smaller, while elements to the right are larger. In (a), the tree has been 
filled with the first eleven members from the list and the largest element in 
the tree is marked for removal. In (b), the operation is completed and the 
12th member from the list is added. In (c), the largest element in the tree is 
once again marked for removal. In (d), the operation is completed and the 
13th member from the list is added. 
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deleted. This can be done in roughly log k time by traversing the tree from 
its root. A new element is then added to the tree, also in log/c time, and the 
largest once again deleted. The procedure is repeated until all the elments 
in the li st have been added to the tree and the only ones remaining are the 
k least flKnuthl - EoOsh . 

While it might appear that repeatedly deleting the largest element will 
produce an unbalanced tree, actual tests suggest that this is not the case. 
There are at least two mitigating factors. First, the greatest element is not 
always a "leaf" or lowermost element, but is itself often a node, thus an entire 
sub-tree will take its place as figures [T]^c) and (d) exemplify. Second, new 
elements are constantly being added to the tree. These will tend to be larger 
on average than those already occupying it since we are selecting for the k 
least. 



3.3 Root-finding 

To sample the class border, the following procedure was employed: pick two 
points at random, xl and a^, belonging to classes 1 and 2 respectively. We 
define u as a direction vector between the two points, e.g.: 

V = X2 — Xi (32) 

and solve the following for t: 

R{xi + vt) = (33) 

to find a point, b = Xi + vt\ji=o, randomly located on the class border. 

This reduces the D-dimensional root-finding problem to only one dimen- 
sion, with the root already bracketed, thus it can be found with considerable 
certainty. The root of a one-dimensional function is considered bracketed 
when we have two values of the independent variable for which the func- 
tion evaluates to o pposite signs, therefore at least one root must lie between 



( jPress et all Il992l ). The derivatives, dR/dt = V^R ■ v, are used as an aid to 
increase the speed of convergence. If both the value and the first derivative 
of the function are known at two locations, then it is possible to fit a unique, 
third-order polynomial. The root is estimated by equating the polynomial 
to zero, and the true root re-bracketed with this new estimate - see figure 
[21 The procedure is repeated until the true root is found to within a certain 
tolerance. This combines the fast convergence of a Newton's method with 
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Figure 2: Demonstration of the root-finding algoritlim. In a., the root is 
bracketed for the first time. A third-order polynomial is fitted and the root 
approximated by solving the cubic equation. In b., the function has been re- 
bracketed with the new root and a new polynomial fitted between the new 
brackets. 
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Table 1 : Table of values used to define the "spine" of the distribution defining 
the second of the two synthetic test classes. 



X 


y 


0.17 


0.79 


0.36 


0.70 


0.51 


0.84 


0.70 


0.86 


0.76 


0.68 


0.77 


0.48 


0.68 


0.32 


0.46 


0.28 


0.24 


0.26 



the numerical stability of a bisection algorithm (jPress et aUll992l ). hence we 
term the method, "supernewton." The GNU scientific library (GSL) is used 
both to solve the cubic and to fit the functi on by solving the ran k four linear 
system using Householder transformations ( Galassi et al . 20071 ). 



4 Comparison with competing algorithms us- 
ing an artificial dataset 

4.1 A pair of test classes 

The AGF method was tested on a synthetic dataset consisting of the pair 
of classes shown in figure O The first class, illustrated using triangles, is a 
simple, two-dimensional normal distribution: 



P(f|l) 



1 



27r (71(72 



exp 



y_ 

0-2 



X 



COS 6 sin 6 
— sin 6 cos 6 



[X 



Xo) 



(34) 
(35) 



where is the centre of the distribution, ai and (72 are the widths of its major 
and minor axes respectively, 6 is the angle of the major axis and x' = [x', y'). 
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To produce a non-trivial interface between the two classes, the second one 
has a more intricate design. A set of points defining the "spine" of the distri- 
bution was chosen - see Table [TJ Individual samples are first in terpolated a 



random distance along th e curve so defined using a cubic spline flPress et al. 



19921 : iGalassi et ali 120071 ). By displacing this point a random distance in a 
random direction, the final location of the sample is determined. 

Analytic or semi-analytic (e.g. using numerical quadradure to perform 
the integration) values for the probability densities of the second class may 
be calculated as follows: 

P{x\2) = / Q{\x - g{s)\)d^ (36) 

^max Jo 



'max 



where g is the line segment defining the spine, s is the path along it, 
is its length and Q{r) is a circularly symmetric PDF governing the offset 
distance from the backbone. For the dataset shown in figure [31 a Gaussian 
(of width 0.1) was once again employed for Q. 



4.2 Competing algorithms 

To test the effectiveness of the AGF classification algorithm, it was compared 
with three other, popular methods. These are: /c- nearest-neighbours (KNN), 
Learning Vector Quantisation (LVQ) and Support Vector Machines (SVM) 
and are briefly described below. 



4.2.1 Ji-nearest-neighbours 

The KNN is one of the simplest and most robust classification methods avail- 
able. It consists of finding the k training samples that are closest to the test 
point and then counting how many there are of each class. The class with 
the most samples is the class of the test point. 

Our method reduces to a KNN by using a filter function that is one out 
to the filter width and zero everywhere else: 

where a is the filter width. 
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4.2.2 Learning Vector Quantisation 

LVQ works by building up a set of "codebook vectors" whose density matches 
the difference between the densities of the two classes. The vectors will be 
labelled based on which class they fall within. The class of a test point will 
be given by the class label of the nearest codebook vector. 

The training process is performed as follows: a set of codebook vectors are 
initialised at random points and assigned class labels. The codebook vectors 
are randomly compared with the training samples. If the labels of the two 
match, then the codebook vector is moved closer to t he training sample. If 



they don't, then it is moved further away( lKohonenl . l2000l : iKohonen et al. 



19951). 



4.2.3 Support Vector Machines 

In SVM, a single hyperplane is drawn that best divides the two classes. 
Classifications are done as in equations (fT4|) - (fT6|l based on a dot product 
with the normal to the class border. The border is fitted by minimizing the 
classification error. Obviously, using a single straight line to divide the two 
synthetic sample classes will produce poor results. Results may be improved 
by adding variables derived from the originals thus expanding the dimension 
of the space, much as one fits a nonlinear function, e.g. a polynomial, by 
performing a linear fit on the set of variables formed by transforming the 
independent variables with a set of basis functions. 

The so-called kernel trick can be applied to most algorithms that use 
scalar products and is based on the observation that certain mathemati- 
cal operations applied to a scalar product will expand the variables in a 
set of basis functions, without having to explicitly calculate them. For ex- 
ample, consider the squared dot product of two two-dimensional variables 



(IMiiller et a/.l . l200lh : 



x-y) = [{xuX2) ■ {yi,y2)] (38) 
= xjxl + 2x1^/1X2^/2 + xlyl (39) 



^2x1X2, xl) ■ {yl V2yiy2, yl) (40) 



4.3 Comparison results 

Random synthetic datasets composed of 5000 samples of the first class and 
10000 samples of the second class were created as needed using the algorithms 
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Table 2: Comparison of parameters used in the KNN, LVQ, AGF and AGF 
borders classification algorithms 



KNN 


LVQ 


AGF 


AGF borders 


k = 101 


ao = 0.1 
rit = 75000 
n = 1000 


Wc = 100 
k = 1000 


Wc = 100 
k = 1000 
n = 250 
e = 0.0001 



Table 3: Parameters used in the SVM classification algorithm 



implement at ion : 


LIBSVM 


type: 


C-SVC 


kernel basis function: 


exp(7 - x^-p) 


r- 


0.5 


C (cost): 


100 


e: 


0.001 



described in section 14.11 Test datasets composed of three-thousand (3000) 
members were created separately and had no fixed ratio between the number 
of members in each class. Rather the ratio of the training data was used to 
randomly select which class was sampled. 

The algorithm was also compared with an analytical classification scheme 
that compares the results of equations flM|) and fl36|l applied to the test 
point. This has the advantage of quantifying the limit in accuracy of any 
classification algorithm, as well as returning the conditional probabilities. 

Parameters for the KNN, LVQ, AGF and AGF borders techniques are 
compared in Table O For the LVQ method, Kohonen's so-called optimised 
LVQ 1 (OLVQl) method was used. Ut is the number of training cycles, while 
n is the number of codebook vectors for the LVQ method and the number of 
border samples for the AGF border classification method. Note that AGF 
can be applied directly without searching for the class borders using equation 
(ITT]) . Also, performing the AGF classifications with all the data would be 
too slow, so the /c-nearest-neighbours supplying the most weight are selected 
before applying the algorithm. This is done in n log k time using a binary tree 
as described in section 13. 2[ The parameters used for the SVM method are 
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Table 4: Summary of validation and comparison results 



Algorithm 


training 


classification 


uncertainty 


accuracy 


correlation 




time (s) 


time (s) 


coefficient 




of R 


Analytic 


N/A 


1.29 ±0.20 


0.53 ±0.02 


0.906 ±0.005 


1. 


KNN 


N/A 


5.59 ±0.31 


0.53 ±0.02 


0.905 ±0.005 


0.9953 ±0.0006 


AGF 


N/A 


9.40 ±0.31 


0.53 ±0.02 


0.905 ±0.005 


0.9979 ±0.0003 


AGF borders 


4.19 ±0.10 


0.013 ±0.002 


0.53 ±0.02 


0.905 ±0.005 


0.9972 ± 0.0008 


LVQ 


2.81 ±0.01 


0.101 ±0.003 


0.50 ±0.02 


0.898 ±0.006 


N/A 


SVM 


112.4 ±3.6 


2.22 ±0.15 


0.53 ±0.02 


0.905 ±0.005 


0.9978 ± 0.0003 



listed in Table |3l The parameter e represents the fitting tolerance for both 
AGF and SVM. Parameters were hand-selected to maximize efficiency with- 
out compromising accuracy using the cross-validation procedure described 
above. 

Table H] shows the results of the comparison over twenty (20) trials. The 
uncertainty coefficient is a better method of validating classification results 
than simple accuracy (fraction of correct guesses) because it is not affected 
by the relative number of samples in each class. It is defined as follows: 



Hm = -5^P(2,j)lnP(^|j) 
H{i) = ^P(2)lnP(i) 

i 

H{t)-H{t\j) 



Hii) 



(41) 
(42) 

(43) 



where i and j enumerate the true and retrieved classes respectively, P{i,j) is 
their joint probability, P{i\j) = P{i,j)/P{j) is their conditional probability, 
and P{i) = ^jP{hi) is the total, invariant probability of the first class. 
If we think of the classification procedure as a noisy channel, then U{i\j) 
quantifies how many bits of knowledge we have of the true value of the class 
as a fraction of the maximum number of bits it is po s sible to transmit per 



classification (jPress et all Il992l : [Shannon and Weaverl . Il963l ) 



The last column in the table is simply the correlation coefficient of the 
estimates of R{x) vs. the true values as computed by equations (l34l) - (!36|l . 
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Conditional probabilities of KNN 




R(x) analytic 
r= 1.00 



Figure 4: Comparison between estimates of conditional probabilities for the 
synthetic test classes. F-axis is KNN, X-axis is semi-analytic. 
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Conditional probobilitios of AGF 




R(x) analytic 
r= 1.00 



Figure 5: Comparison between estimates of conditional probabilities for the 
synthetic test classes. F-axis is AGF without borders training, X-axis is 
semi- analytic. 
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Conditional probobilitios of AGF borders 




R(x) analytic 
r= 1.00 



Figure 6: Comparison between estimates of conditional probabilities for the 
synthetic test classes. F-axis is AGF with borders training, X-axis is semi- 
analytic. 



Int. J. Remote Sens. 32(21): 6109-6132 



21 



Conditional probobilltles of SVM 




R(x) analytic 
r= 1.00 

Figure 7: Comparison between estimates of conditional probabilities for the 
synthetic test classes. F-axis is LIBSVM, X-axis is semi-analytic. 



Int. J. Remote Sens. 32(21): 6109-6132 



22 



Condi;ional probabilities of AGF 




Figure 8: Comparison between estimates of conditional probabilities for the 
synthetic test classes. F-axis is AGF with borders training, X-axis is AGF 
without. 
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Class borders: LVQ vs. explicitly sampled 
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Figure 9: The class border of the synthetic test classes as characterized by 
three different methods. 



For a visual comparison, see figures HHTl Figure [8] compares estimates from 
AGF with borders training and without. 

The main thing to note from this comparison is how much faster AGF 
with borders training is than the other four methods. This is important if 
we need to process large amounts of satellite data, especially in real time. 
The only other method that even comes close is LVQ, especially in the train- 
ing phase, where it is much faster. Unfortunately, it does not supply any 
knowledge of the conditional probabilities, a necessary quantity for measur- 
ing the accuracy of a given classification with no prior knowledge of its true 
value. They are also useful for re-calibration of retrieved images, as will be 
demonstrated later. 

Another short-coming of the LVQ method is that it samples very sparsely 
near the class borders, the exact region where we require the most knowledge. 
In fact the density of codebook vectors approaches zero at the class border; 




True 
AGF 



LVQ 
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the method is actually designed this way! When we plot the class border 
between the codebook vectors from an LVQ training run, this produces a 
line that is jagged and meandering. Contrast this to the border found via 
AGF as shown in figure 14.31 

5 Application to Landsat images 

Because of its global coverage and high resolution, Landsat 7 is one of the 
most sophisticated surface-mapping instruments. It images the globe us- 
ing seven channels in the visible and near-infrared. Because of the 30 m 
resolution, any kind of processing will be highly computationally intensive. 
Landsat images are frequently used "as-is" to simulate aerial photographs by 
simply forming a colour image from three of the channels. A more powerful 
use, however, is feature-detection or surface classification using automated 
discrimination algorithms such as statistical classification. 

The waterways in the Netherlands form a dense, complex network and 
because they produce dry land where once there was open sea are one of 
the engineering wonders of the world. Statistical classification was used to 
map the lakes, canals and rivers of the Netherlands based on a Landsat 
7 image as seen in figure [TOl Training data was selected manually from 
three images: two from Southern Ontario - LE70170292008280EDC00 and 
LE70180302008207EDC00 - whose surface is one third fresh water and one 
from Northern Germany - LE71960232008206ASN00 - whose landscape re- 
sembles the Netherlands. 1822 training samples were selected in total while 
the six Landsat 7 channels with 30 m resolution were used in the analysis. 

Training times for AGF, LVQ, SVM and SVM with probability estimates 
were 0.5 s, 0.7 s, 0.15 s and 0.7 s, respectively. Classification times were 5.5 
minutes, 20 minutes, 55 minutes and 58 minutes, respectively. In this ex- 
ample the training time is fairly immaterial because of the small number of 
samples, however for classification, AGF is nevertheless still the clear winner. 
200 border samples were used. For LVQ, 200 "codebook" vectors were em- 
ployed, although as implied in section 14. 3^ this is giving LVQ an unfair speed 
advantage since it will take more "codebook" vectors than border samples 
to represent the discrimination border to a similar level of precision. 

Figure [11] demonstrates the utility of having the conditional probabilities 
available. Using a different set of lower-quality training data, many water 
points are now mis-classified as land. Therefore, we re-calibrate the algorithm 




Figure 10: Landsat image (LE71980242002267EDC00) of the S. Netherlands 
whose pixels have been automatically classified into land and water using a 
training dataset derived from three other Landsat images. 




Figure 11: Landsat image (LE71980232007217ASN00) of the N. Netherlands 
whose pixels have been automatically classified into land and water. The hor- 
izontal striping is caused by a malfunction in the instrument control system 
affecting all Landsat 7 images after 31 May 2003. 




Figure 12: Landsat image (LE71980232007217ASN00) of the N. Netherlands 
whose pixels have been automatically classified into land and water. Image 
has been re-calibrated by shifting the discrimination border. 
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Comparison of conditional probability esiinnotes 




-1.0 



-1.0 -0.5 0.0 0.5 1.0 

R (AGF) 
r= 1.00 

Figure 13: Comparison of conditional probability estimates using AGF with 
borders training versus direct AGF of pixels classified into land and water 
from Landsat 7 image LE71980242002267EDC00. Contours follow a geomet- 
ric progression. 
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by choosing a different, lower threshold value, R = —0.8, for the discrimi- 
nation border, similar to what is described in iMillsl (120091 ) . The corrected 
image is shown in figure [T2l It is still not perfect, but the transformation has 



done a good job of re-classifying many of the points lying in the tidal fiats 
as water instead of land. 

In real-world problems, if the PDF's of the classes are not monotonic 
and roughly Gaussian near the borders, then conditional probability esti- 
mates based on (!23|) will be inaccurate. Fortunately, this is rarely the case. 
Figure [13] compares conditional probabilities estimated by extrapolation us- 
ing equation (!23|) and more directly using the right-most side of equation 
f ll2p for the surface classifications illustrated in figure [TUJ In libSVM, con- 
ditional probabihtiesare^^ estimated using a one- dimensional parametri- 
sation fjChang and Linl . l200ll ). so the best way to ensure accurate estimates 
is to apply one of the more direct methods - KNN or AGF without borders 
training. 



6 Discussion 

6.1 Algorithmic efficiency 

The speed of the competing algorithms for a single problem does not tell the 
whole story since we also need to know the algorithmic efficiency, that is how 
the speed of the different algorithms depends on both the training set size 
and on the selection of parameters. In the case of KNN, classification times 
depend on the time it takes to select the k neighbours, in our case n log k time. 
Since k is generally increased with increasing numbers of training samples, 
this means that the time dependence is more like nlogn. Actual tests of the 
selection algorithm, however, show a very weak dependence on k. Best case 
performance for a selection algorithm such as ours tha t sorts the re sults is 



actually n + k log k, while for one that does not it is n ( iKnuthl . |1998[ ). 

If this step is included, the bottleneck in the AGF algorithm is selecting 
the k nearest neighbours, hence the time efficiency will also be roughly n log k 
worst case, although with further overhead of order k from calculating the 
weights. If the selection step is skipped, the time efficiency will be n, but 
with a large coefficient. Time efficiency for LVQ training will be n, while for 
SVM training, it is since the solution of a matrix equation is required. 
Note that all methods will have extra overhead from reading in the training 
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data and also from writing the output. 

For algorithms that have a separate training phase, there is also the issue 
of the classification time. For both LVQ and AGF borders, classification 
times will be independent of the number of training samples. Time efficiency 
will rather depend linearly on the number of samples used to represent the 
discrimination border - codebook vectors for LVQ and border samples for 
AGF - which are both adjustable parameters. The larger these parameters, 
the greater the accuracy, although with diminishing returns. For equivalent 
accuracy, more LVQ codebook vectors than AGF border samples will be 
needed. 

In the case of SVM, classification times do depend on the number of train- 
ing samples, which would seem to make the training phase a bit redundant. 
The time efficiency appears to be a bit better than n, where n is the number 
of training samples. 



6.2 Use for continuum retrieval 

Within the context of satellite remote sensing, statistical classification would 
appear to be a somewhat specialized tool, useful chiefly for processing images 
by extracting features or classifying surface types. The method truly comes 
into its own, however, when recognised as an efficient, general, non-linear 
inverse method. 

Like a neural network, statistical classification generates a direct inverse 
method that has the potential to be very fast. Unlike a neural network, 
however, there is no possibility of getting stuck in a local minimum when 
trying optimise the model. Accuracy of the discrimination border is limited 
only by the resolution of the samples. The presence of l ocal minima ma y also 



confound inverse methods, such as optimal estimation ( iRodgerd . |2000| ). that 
use a forward model directly. 

Also unlike a neural network, statistical classification generates only dis- 
crete values: a limitation which is surprisingly easy to overcome. A con- 
tinuum variable can be retrieved by dividing it into ranges. Designate the 
continuum variable as q. Just like in the discrete case, the relationship be- 
tween the measurement variable, x, and the state variable, q, will be governed 
by a conditional probability, P{q\x). Suppose we divide q into two ranges 
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0.0001 0.0010 0.0100 

specific humidity 

Figure 14: Comparison of conditional probability estimates with continuum 
values for a discrete water-vapour retrieval (threshold value for two ranges 
is set at 0.001 mass-mixing ratio). Conditional probabilities are estimated 
directly using AGF without borders training. 



Int. J. Remote Sens. 32(21): 6109-6132 



32 



with a threshold value, go? so that the classes are defined as follows: 

2- It* («) 

The continuum conditional probability is transformed to the discrete condi- 
tional probability by integration: 

/go 
P{q\x)dq (45) 
-oo 
/■oo 

P(2|f) = / P{q\x)dq (46) 
J go 

If the divisions in q are made fine enough, i.e. smaller than the estimated 
error, then these can be used directly. Alternatively, the continuum value can 
be reconstituted in a number of other ways. For a two-dimensional retrieval. 



we ca n retrieve a series of isolines and then interpolate between them (IMillsl . 



20091 ). If conditional probabilities are available, they can actually make quite 
a good proxy for the continuum result. Figure \\M compares conditional prob- 
abilit i es wit h continuum values from the water-vapour retrieval described in 



Mills! ( 120091 ). It is easy to show from equations fHSl) and P6|l that if the 
statistics of q for a given x in P{q\x) are Gaussian, then the relationship 
between the two will be an error function, as seen in the figure. Given: 

P(g|f) = exp I -M^IuM! k (47) 



271 a, ^ I 2crg 

where aq is the width of the distribution and q is the expectation value. 
Then: 

R = erf (48) 

This result can be used not only to estimate the continuum result, but also 
to replace the hyperbolic tangent form used in equation ( l23l) to estimate R. 
This assumes that q is roughly linear in x near the class border. 

Another feature of such a retrieval is its robustness. Once classes have 
been defined, it is impossible to generate a result outside of this range. More- 
over, it is easy to see from ( H5|) and (H6|) that classification retrieval of contin- 
uum values is what is known as a robust estimator. Normally, q is estimated 
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by evaluating its expectation value or first moment: 

/oo 
P{q\x)qdq (49) 
-oo 

By contrast, the location of go is found by setting: 

i>qo i>QO 

P{q\x)dq= / P{q\x)dq (50) 



90 



or equalizing the fraction of "zeroth order" moment on each side of the thresh- 
old. This type of formulation is characteristic of robust estimators and re- 
sults in outhers in the training data having less effect on the final model 



f lPress et al\ . Il992[ ) 



Finally, with the continuum variable divided into multiple ranges, re- 
turned error statistics can be made more detailed. By examining the con- 
ditional probabilities within each range, not only the statistics, but also the 
approximate functional form of the error may be determined. 



7 Conclusion 

A statistical classification algorithm based on kernel-density estimation was 
described, which we term Adaptive Gaussian Filtering or AGF. The many 
refinements of this algorithm produce impressive speed gains compared to 
other methods making it appropriate for processing large amounts of satel- 
lite data, especially in real-time. As applied to a synthetic test dataset, the 
method is shown to be 25 times as fast for training and 125 times as fast 
for classification, when compared to LIBSVM, a Support Vector Machine 
implementation. The performance advantage becomes greater the larger the 
number of training samples. Therefore, when tested on real data, discrimi- 
nating land from water in the Netherlands, the performance advantage was 
reduced because of the smaller training dataset - training times were roughly 
equivalent, while classification was ten times as fast. 

The algorithm was also compared to /c-nearest-neighbours (KNN) and 
Kohonen's Learning Vector Quantisation (LVQ). Three steps can be dis- 
tinguished in the AGF algorithm. First, the conditional probabilities are 
estimated using kernel averaging. These estimates can be used directly to 
classify a set of test data from the training data, in a manner equivalent to a 
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KNN. Second, the conditional probability estimates are used to search for the 
discrimination or class border in the case of a two-class classification. Once 
the discrimination border is known, it can be used to generate classification 
estimates more quickly than directly with the training data. Thus, KNN is 
not as fast as AGF when borders training is included. 

While LVQ is similarly fast, both for training and classification, it does 
not return estimates of the conditional probabilities and is less accurate. 
Estimates of the conditional probabilities are useful for determining the ac- 
curacy of an estimate in absence of its true value. They are also useful for 
recalibrating an image when the classification estimates are biased. 

Conditional probabilities are estimated using a simple parametrisation 
based on their gradients at the border. For the most accurate estimates of 
the conditional probabilities, a direct method such as KNN or AGF without 
borders training can be used. For most applications, however, the gains in 
accuracy will be too small to offset the significant speed penalty. 

While statistical classification finds broad application in image processing 
techniques like feature extraction and surface detection, its full power as 
a general, non-linear inverse method has yet to be harnessed. Continuum 
variables can be retrieved by dividing them into discrete ranges. Such a 
technique has the advantage over a neural network or a more direct inverse 
method such as optimal estimation in that there is no possibility of becoming 
stuck in a local minimum. The accuracy of the estimates is limited only by 
the number and resolution of the training samples. 

Software can be found at: ,http : //libagf . sourcef orge .net 
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